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ABSTRACT 

Various  iterative  methods  for  solving  the  linear  systems  associated  with  fin- 
ite element  approximations  to  self-adjoint  elliptic  differential  operators  are 
compared  based  on  their  performance  on  serial  and  parallel  machines.  The 
methods  studied  are  all  preconditioned  conjugate  gradient  methods,  differing 
only  in  the  choice  of  preconditioner.  The  preconditioners  considered  arise 
from  diagonal  scaling,  incomplete  Cholesky  decomposition,  hierarchical 
basis  functions,  and  a  Neumann-Dirichlet  domain  decomposition  technique. 
The  hierarchical  basis  function  idea  is  shown  to  be  especially  effective  on 
both  serial  and  parallel  architectures. 

1.   Introduction 

The  preconditioned  conjugate  gradient  method  [9,4]  has  proved  to  be  an  effective  itera- 
tive procedure  for  solving  the  linear  systems  arising  from  finite  element  and  finite  difference 
approximations  to  self-adjoint  elliptic  partial  differential  equations.  Recent  effort  has 
focussed  on  developing  efficient  preconditioners  for  such  problems,  to  insure  rapid  conver- 
gence of  the  iteration. 

In  use  for  some  time  have  been  diagonal  scaling  [4]  and  incomplete  Cholesky  [12] 
preconditioners.  The  application  of  these  preconditioners  requires  knowledge  only  of  the 
linear  system  to  be  solved,  not  of  the  partial  differential  equation  or  differencing  technique 
used.  When  applied  to  problems  arising  from  the  discretization  of  a  partial  differential 
operator  on  some  mesh,  however,  these  methods  slow  down  considerably  as  the  mesh  spac- 
ing becomes  small. 

In  recent  years,  preconditioners  have  been  developed  that  insure  convergence  of  the 
conjugate  gradient  method  at  a  rate  that  is  independent  of,  or  only  mildly  dependent  on,  the 
mesh  spacing.  Such  preconditioners  usually  require  some  knowledge  of  the  partial  differen- 
tial equation  and/or  the  mesh  and  differencing  technique  used,  and  so  they  are  less  general 
than  diagonal  scaling  and  incomplete  Cholesky  decomposition.  For  fine  enough  meshes, 
however,  they  are  sure  to  outperform  these  other  preconditioners  in  terms  of  number  of 
iterations  required  to  achieve  convergence.  Two  such  preconditioning  ideas  are  considered 
here.  One  arises  from  using  hierarchical  basis  functions  [16],  instead  of  the  usual  nodal 
basis  functions,  to  represent  the  finite  element  space.  The  other  is  an  iterative  substructur- 
ing  technique  known  as  Neumann-Dirichlet  domain  decomposition  [6],  which  requires  divid- 
ing the  domain  of  the  problem  into  pieces  and  at  each  iteration  solving  separately  on  the  dif- 
ferent pieces.  While  it  can  be  shown  that  for  a  wide  class  of  finite  element  approximations 
to  self-adjoint  elliptic  partial  differential  equations  the  number  of  iterations  required  by 
these  methods  is  bounded  by  0(Iog  h~^),  the  constant  multiplying  the  log  h~^  factor  is  prob- 
lem dependent,  and  so  numerical  experiments  are  needed  to  determine  the  suitability  of  the 
methods  for  solving  particular  problem  types. 


All  of  these  methods  offer  some  opportunity  for  exploiting  parallelism.  In  fact,  for  a 
fixed  number  of  processors  P,  if  the  problem  size  is  large  enough,  then  each  of  these 
methods  can  achieve  a  full  factor  of  P  speedup  over  its  serial  execution  time.  The  only 
differences  lie  in  the  size  and  serial  execution  time  of  problem  needed  to  achieve  this  factor 
of  P  speedup;  i.e.,  in  the  "granularity"  of  parallel  tasks  compared  to  the  total  work  per- 
formed. 

In  the  following  section  the  algorithms  are  briefly  described,  with  references  to  the 
relevant  literature  for  proofs  of  convergence  rates,  etc.  Next  the  different  parts  of  each 
method  are  analyzed  with  regard  to  number  of  operations  required  and  amount  of  parallel- 
ism present.  Asymptotic  estimates  of  serial  and  parallel  work  are  given.  Finally,  results  of 
numerical  tests  performed  on  the  NYU  Ultracomputer  Prototype  are  reported. 

2.   The  Algorithms. 

As  mentioned  previously,  the  algorithms  described  here  are  all  preconditioned  conju- 
gate gradient  methods  [9,4],  differing  only  in  the  choice  of  preconditioner.  When  the  conju- 
gate gradient  method  is  employed  to  solve  a  symmetric  positive  definite  linear  system 
Ax  =  b,  its  convergence  rate  can  be  estimated  in  terms  of  the  square  root  of  the  condition 
number  of  the  coefficient  matrix  A.  If  the  matrix  A  is  well-conditioned,  convergence  will  be 
rapid.  If  A  is  not  well-conditioned,  one  often  uses  a  "preconditioning"  matrix  M,  having 
the  property  that  (1)  linear  systems  with  coefficient  matrix  M  are  relatively  easy  to  solve, 
and  (2)  the  matrix  M~^A  is  well-conditioned.  One  then  solves  a  linear  system  with  coeffi- 
cient matrix  M  at  each  iteration  of  the  conjugate  gradient  algorithm,  and  the  number  of 
iterations  required  depends  on  the  square  root  of  the  condition  number  of  M~^A.  The  con- 
jugate gradient  algorithm  with  preconditioner  M  can  be  stated  as  follows: 

Given  an  initial  guess  x^,  compute  r^  =  b—Ax^, 
and  solve  for  z°  the  linear  system  Mz^  =  r^. 
SetpO  =  zO.    Compute  ao  =  zO'rO.   Fork=l,2 

Compute  /4/7*"^ 

Take   j:*  =  x''~^  +  a^-ip'''^. 

Compute   r*  =  r*~^  —  <3jt_iAp*"^ 

Solve  for  2*  the  linear  system   A/z*  =  r*. 

Check  for  convergence.   If  convergence  criterion  satisfied,  then  quit.   Else, 

Set  Ok  =  z*V*   and    b^  =  -: . 

Compute  p*  =  z*^  +  bkP^~'^. 
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The  simplest  preconditioner  to  use  in  the  conjugate  gradient  algorithm  is  the  diagonal 
of  A,  M  =  diag{A).  Of  all  diagonal  preconditioning  matrices,  it  can  be  shown  [1]  that  this 
is  nearly  optimal  for  minimizing  the  condition  number  of  M~^A.  We  will  refer  to  the  conju- 
gate gradient  method  with  this  diagonal  scaling  preconditioner  as  DSCG  (diagonally  scaled 
conjugate  gradients). 

A  more  sophisticated  preconditioner  is  obtained  by  performing  an  "incomplete  Chole- 
sky  fartorization"  [12]  of  the  matrix  A.  Here  one  factors  the  matrix  A  approximately  into 
the  product  of  a  lower  triangular  matrix  and  its  transpose,  where  the  triangular  factors  have 
nonzero  entries  only  in  positions  where  the  corresponding  triangular  parts  of  A  have 
nonzeros.  The  nonzero  entries  are  chosen  so  that  the  product  of  the  incomplete  Cholesky 
factors  matches  A  in  places  where  A  has  nonzeros,  but  this  product  also  has  some  nonzero 
elements  in  places  where  A  has  zeros.  The  form  of  the  incomplete  Cholesky  factors  and 
their  product,  for  a  nine-striped  matrix  A,  such  as  that  arising  from  a  piecewise  bilinear  fin- 
ite element  approximation,  is  pictured  below: 


K 


The   conjugate   gradient   algorithm   with  the   incomplete   Cholesky  preconditioner   will  be 
referred  to  as  ICCG  (incomplete  Cholesky  conjugate  gradients). 

To  describe  the  other  two  preconditioners,  we  must  first  consider  the  nature  of  a  finite 
element  approximation  for  a  self-adjoint  elliptic  partial  differential  equation 

Lu  =  f 

defined  on  a  domain  O,  and  satisfying  boundary  conditions,  say, 

u  =  0   on  Fi,     u„  =  0   on  r2, 

u„  =  normal  derivative  of  w,      Tj  (J  Ta  =  F  =  boundary  of  Cl  . 

An  approximate  solution  u  for  such  a  problem  can  be  chosen  from  a  finite  dimensional  space 
S„  in  such  a  way  that  the  "energy"  norm  of  the  error 
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<L(u  —  u),  u  —  u> 

is  minimized,  or,  equivalently,  in  such  a  way  that  the  residual  is  orthogonal  to  every  element 
of5„: 

<L(u-u),  v>  =  0,      V  ve5„. 

[Here  the  symbol  <.,.>  denotes  the  usual  Lj  inner  product,  after  integrating  by  parts  or 
applying  Green's  theorem  to  eliminate  high  order  derivatives.] 

To  find  this  approximate  solution,  one  uses  a  "basis"  4)i,  <{>2,  ...,<})„  for  the  space  S„ 
and  writes  u  as  a  linear  combination  of  the  basis  functions: 

uix)  =  J  cxj<i>jix)  . 

The  coefficients  aj,  .  .  .  .a^  are  chosen  so  that  u  satisfies 

<Liu-u),  4>,>  =  </,4>,>  -  J  a;  <L4)j,  4),>  =  0  .     i=l,....n. 


or,  in  matrix  form, 


Aa  =  f ,      a  = 


"1 


ot. 


.      /  = 


</.4>i> 

</,4)„> 


Aij  =  <L4>^,4),>  ,     i,j=l,...,n. 

Ordinarily,  the  basis  functions  <i>i,  .  .  .  ,<^„  are  chosen  to  be  nonzero  over  just  a  few 
intervals  of  the  mesh.  The  matrix  A  is  then  quite  sparse,  having  zero  entries  in  positions  i,j 
such  that  <{),•  and  4);  are  nonzero  over  disjoint  intervals.  For  example,  in  one  dimension,  a 
typical  choice  for  5„  is  the  space  of  continuous  piecewise  linear  functions  satisfying  the 

essential  boundary  conditions,  and  the  basis  functions  ct>i ^„  are  taken  to  be  one  at  one 

node  and  zero  at  all  others,  as  pictured  below: 

/X\-  •  -xx 


In  two  dimensions,  it  is  common  to  use  triangular  elements  with  piecewise  linear  approxima- 
tion funaions,  or  rectangular  elements  and  piecewise  bilinear  approximations.  The  usual 
"nodal"  basis  functions  are  aeain  nonzero  at  one  node  and  zero  at  all  others. 
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When  the  matrix  A  is  formed  from  the  energy  inner  products  of  such  basis  functions 
and  the  operator  L  is  a  second-order  elliptic  differential  operator  with  appropriate  boundary 
conditions,  it  can  be  shown  that  the  condition  number  of  the  resulting  linear  system  is 
0(h~'^),  where  h  is  the  minimum  mesh  spacing  of  the  regular  grid.  Thus,  the  conjugate  gra- 
dient algorithm  applied  to  such  a  linear  system,  with  no  preconditioner,  requires  0(h~^) 
iterations  to  achieve  a  desired  fixed  level  of  accuracy. 

The  same  finite  element  approximation  can  be  obtained  using  different  basis  functions. 
If  the  usual  nodal  basis  functions  are  replaced  by  "hierarchical"  basis  functions  [16],  it  can 
be  shown  that,  in  two  dimensions,  the  condition  number  of  the  resulting  matrix  A  is  of  order 
0(\og  h~^)^,  and  so  the  conjugate  gradient  algorithm  can  achieve  a  desired  fixed  level  of 
accuracy  in  0(log  h~^)  iterations.  Hierarchical  basis  functions  consist  of  the  usual  nodal  basis 
funaion(s)  on  a  coarsest  grid,  together  with  the  nodal  basis  functions  for  a  finer  grid 
corresponding  to  nodes  that  are  not  present  in  the  coarsest  grid,  together  with  the  nodal 
basis  functions  for  a  still  finer  grid  corresponding  to  nodes  that  are  not  present  in  any  of  the 
coarser  levels,  etc.,  up  to  the  finest  grid  on  which  it  is  desired  to  solve  the  problem.  The 
hierarchical  basis  functions  for  the  one-dimensional  space  of  continuous  piecewise  linear 
functions  are  pictured  below: 


^1^2 


^, 


In  two  dimensions,  with  triangular  or  rectangular  grids,  the  picture  is  similar. 

The  linear  system  arising  from  the  hierarchical  basis  functions,  »l/i,  .  .  .  .\\)„,  is  given  by 


AP=/.      p  = 


P. 


/  = 


</.  4'i> 


<f/^r,> 


Aij  =  <Ly\ij,\\ii>,     i,j=\,...,n, 
where  the  approximate  solution  u{x)  has  been  written  in  the  form 

The  matrix  A  arising  from  the  hierarchical  basis  is  much  less  sparse  than  the  matrix  A  associ- 
ated with  the  nodal  basis  representation.  Fortunately,  however,  it  i.s  not  necessary  to  actu- 
ally compute  the  matrix  A.    We  need  only  be  able  to  multiply  A  times  a  vector  in  order  to 
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solve  the  linear  system  using  the  conjugate  gradient  algorithm. 

Let  S  be  the  linear  transformation  which  maps  coefficient  vectors  p  of  the  hierarchical 
basis  representation  of  a  function  in  S„  into  the  corresponding  coefficient  vector  a  of  the 
nodal  basis  representation  of  the  function.  Then  it  is  not  hard  to  see  [16]  that  the  matrix  A 
is  given  by 

A  =  S'^AS  , 

where  A  is  the  matrix  associated  with  the  nodal  basis  functions.  Thus,  to  multiply  A  times  a 
vector,  we  first  transform  that  vector  to  the  nodal  basis  representation,  then  multiply  by  A, 
then  perform  the  transpose  of  the  transformation. 

It  is  also  easy  to  see  that  applying  the  unpreconditioned  conjugate  gradient  algorithm  to 
the  matrix  A=S'^AS  is  equivalent  to  applying  the  preconditioned  conjugate  gradient  algo- 
rithm to  the  matrix  A,  with  preconditioner  M  satisfying 

A/-1  =  557- 

By  applying  the  algorithm  in  this  form,  we  avoid  explicitly  translating  from  nodal  to 
hierarchical  bases  in  the  beginning  and  translating  back  at  the  end.  The  conjugate  gradient 
algorithm  with  this  hierarchical  basis  preconditioner  will  be  referred  to  as  HBCG  (hierarchi- 
cal basis  conjugate  gradients). 

Note  that  the  preconditioner  M  in  no  way  depends  on  the  differential  operator  L!  It  is 
entirely  determined  by  the  finite  element  space  S„  and  by  the  nodal  and  hierarchical  basis 
functions  for  5„.  While  Yserentant  showed  [16]  that  the  condition  number  of  M~^A  is 
0{log  h~^)^  for  all  reasonably  smooth  self-adjoint  elliptic  differential  operators  L,  one 
would  expect  that  if  M  is  a  good  preconditioner  for  one  operator  it  would  not  be  a  good 
preconditioner  for  a  very  different  differential  operator.  This  is  borne  out  in  the  numerical 
examples  of  section  4.  By  considering  diffusion-type  operators  with  strongly  varying  diffu- 
sion coefficients,  one  can  construct  matrices  for  which  this  preconditioner  is  arbitrarily  bad 
(though  still  dependent  only  logarithmically  on  the  mesh  size). 

To  avoid  this  problem,  we  used  a  diagonal  scaling  on  the  matrix  A=S^AS.  According 
to  Bauer's  theorem  [1],  a  nearly  optimal  diagonal  preconditioner  to  use  is  D  =  diag(S^AS). 
This  diagonal  matrix  requires  additional  time  to  compute,  however,  and  we  found  that  we 
obtained  almost  identical  results  using  D  =  diag(A)  as  the  preconditioner  for  A.  Thus,  the 
unpreconditioned  conjugate  gradient  algorithm  was  applied  to  the  matrix 

D-'^AD-'^  ,      D  =  diag(A)  ,  — 

or,  equivalently,  the  preconditioned  conjugate  gradient  method  was  used  to  solve  the  linear 
system  Ax  =  b,  with  preconditioner  M  satisfying 

This  algorithm  will  be  referred  to  as  HBDSCG  (hierarchical  basis  diagonally  scaled 
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conjugate  gradients). 

The  final  algorithm  tried  is  an  iterative  substructuring  technique  known  as  Neumann- 
Dirichlet  domain  decomposition  [6].  This  idea  requires  dividing  the  domain  of  the  problem 
into  "boxes",  as  pictured  below: 


r<    t>     v^     o     r^ 
D     H      D      H-     ^ 


Some  of  the  boxes  are  labelled  "D"  for  Dirichlet  subregions  and  some  are  labelled  "N"  for 
Neumann  subregions.  D-boxes  may  share  a  side  only  with  N-boxes,  and  vice  versa.  Diri- 
chlet points  (mesh  points  in  the  interior  of  D-boxes)  are  numbered  first,  and  Neumann 
points  (mesh  points  in  the  interior  and  on  the  internal  or  free  boundaries  of  N-boxes)  are 
numbered  last.  With  this  ordering  of  unknowns,  using  the  usual  nodal  basis  representation 
of  the  finite  element  space,  the  matrix  A  takes  the  form 


A  = 


Ah  A22 


y         \ 

y         x 

x^ 

^1 

X2 

b2 

v         / 

\         J 

where  the  entries  of  A.\\  represent  couplings  between  Dirichlet  points,  the  entries  of  kii  cou- 
plings between  Neumann  points,  and  the  entries  of /4i2  couplings  between  Diriclilct  and  Neu- 
mann points. 

Formally  solving  the  linear  system 

-^n  An 
A\2  A22 

using  Gaussian  elimination,  one  writes 

(A 22  -  A [2AriU  12)^:2  =  b2-  A^An^bi  , 

Auxi  =  bi  -  A12X2  , 

and  solves  the  Schur  complement  matrix, 

C  =  A22  -  AhAn^An  , 

to  obtain  X2  and  then  substitutes  this  into  the  original  equation  to  obtain  xi.  The  matrix  C  is 
usually  dense  and  quite  costly  to  compute  and  store.    Fortunately,  however,  the  matrix  C 


Ultracomputer  Note  126 


Page  7 


need  not  be  computed  in  order  to  solve  with  the  conjugate  gradient  algorithm.  We  need 
only  be  able  to  multiply  C  times  a  vector.  This  requires  multiplication  by  the  submatrices 
Ai2<  ^2,  and  A22.  and  solving  of  a  problem  on  the  Dirichlet  region.  The  D-boxes  are  uncou- 
pled, and  so  An  is  a  block  diagonal  matrix,  with  each  block  corresponding  to  one  of  the  D- 
boxes  of  the  domain.   The  individual  blocks  can  be  solved  independently  and  in  parallel. 

Unfortunately,  the  matrix  C  is  not  well-conditioned,  as  argued  in  [3].  Hence  a  precon- 
ditioner  is  needed  to  insure  rapid  convergence  of  the  conjugate  gradient  iteration.  The 
preconditioner  proposed  in  [6]  is  the  matrix  A^)  resulting  from  a  finite  element  approxima- 
tion (with  the  usual  nodal  basis  functions)  for  a  problem  defined  only  on  the  Neumann  part 
of  the  region: 


^  Hs 


It  is  shown  in  [14]  that  the  condition  number  of  the  matrix 

A{^y'  C 

is  of  order    0(]og  (t"))^.    where  h  is  the  minimum  mesh  width  and  c!  is  the  maximum  N-  or 

D-box  width. 

Different  N-boxes  couple  at  cross  points.    Hence,  to  solve  aW  we  number  these  cross 
points  last  and  write 

where  B22  represents  couplings  between  points  of  the  Neumann  region,  other  than  cross 
points,  and  has  a  block  diagonal  structure  with  one  block  for  each  N-box  of  the  region.  To 
solve  the  linear  system 


S22 

B23 

Bh 

B33 

\ 

/ 

B22    S23 

Bh   S33 

f        \ 

22 
23 

= 

r2 
^■3 

we  actually  form  the  Schur  complement 
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S  =  S33  -  Bl3B22^B22 

and  factor  this  banded  matrix.  At  each  iteration  of  the  conjugate  gradient  algorithm,  the 
factored  matrix  is  then  used  to  compute  23,  and  this  is  substituted  into  the  original  system  to 
obtain  Z2  by  solving  the  block  diagonal  matrix  522-  The  blocks  of  B22  can  be  solved 
independently  and  in  parallel.  This  Neumann-Dirichlet  preconditioner  used  with  the  conju- 
gate gradient  algorithm  will  be  referred  to  as  NDCG  (Neumann-Dirichlet  conjugate  gra- 
dients). 

3.   Operation  Counts  and  Asymptotic  Estimates. 

A  single  iteration  of  the  preconditioned  conjugate  gradient  algorithm  given  in  section  2 
requires  the  following  operations: 

1    matrix  A  times  a  vector   (ATIMES) 

1  solution  of  a  linear  system  with  coefficient  matrix  M    (MSOLVE) 

2  vector  inner  products   (SDOT) 

1     check  for  convergence,  usually  computing  the  2-norm  of  a  vector   (SNRM2) 

3  scalar  times  a  vector  plus  a  vector   (SAXPY) 

If  A  is  a  nine-striped  matrix,  as  in  the  examples  of  section  4,  then  the  work  per  iteration, 
excluding  the  MSOLVE  operation,  is  approximately 

1  9n  multiplies,  8n  additions   (ATIMES) 

2  n  multiplies,  n  additions      (SDOT) 

1     n  multiplies,  n  additions      (SNRM2) 

3  n  multiplies,  n  additions      (SAXPY) 


29n  floating  point  operations      (TOTAL) 


With  n  processors,  ignoring  synchronization  and  communication  time,  the  ATIMES  and 
SAXPY  operations  could  be  performed  in  time  0(1),  while  the  SDOT  and  SNRM2  opera- 
tions would  require  time  0(log  n).  For  a  fixed  number  of  processors  P ,  however,  with 
P  «  n,  the  execution  time  of  each  of  these  operations  will  be  approximately  VP  times  its 
execution  time  on  a  single  processor.  Because  of  the  processor  synchronization  required 
between  operations,  however,  and  because  of  the  very  simple  nature  of  the  operations  per- 
formed, n  may  have  to  be  very  large  compared  to  P  in  order  to  achieve  this  factor  of  P 
speedup. 

Although  the  algorithms  considered  in  this  paper  use  very  different  sorts  of  MSOLVE 
operations,  it  will  be  argued  later  that  for  large  enough  n  and  a  fixed  number  of  processors 
P ,  each  MSOLVE  operation  can  also  attain  almost  a  full  factor  of  P  speedup,  though  some 
will  require  larger  values  of  n  than  others.  While  a  single  iteration  of  the  conjugate  gradient 
algorithm  can  thus  be  parallelized,  the  different  iteration  steps  must  occur  sequentially.  One 
processor  might  begin  the  work  of  iteration  step  k  while  another  is  still  working  on  step  k-1, 
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but  iteration  k  cannot  be  completed  until  after  iteration  k-1  is  completed.  Thus,  the  most 
important  property  of  a  preconditioned  conjugate  gradient  method,  for  good  performance 
on  parallel  machines,  is  that  it  requires  few  iterations  to  converge.  Additional  work  in  the 
M SOLVE  operation  at  each  iteration  may  be  acceptable  if  it  is  parallelizable  and  if  it  signifi- 
cantly reduces  the  number  of  iterations  required. 

In  the  very  simple  DSCG  method,  M  is  taken  to  be  diag^A),  and  the  MSOLVE  opera- 
tion requires  n  floating  point  divides  (or  n  multiplies  if  diagiA)~^  is  precomputed).  This 
operation  is  easily  parallelized,  and  if  n  processors  were  used,  the  theoretical  execution  time 
would  be  0(1).  In  praaice,  for  a  fixed  number  of  processors  P,  if  n  is  large  enough,  then  a 
full  factor  of  P  speedup  can  be  attained  in  this  operation. 

The  MSOLVE  operation  in  the  ICCG  method  is  more  difficult  to  parallelize.  This 
requires  solving  a  sparse  triangular  linear  system.  If  enough  processors  are  available,  O(n^), 
then  this  operation  can  theoretically  be  performed  in  time  0(]og^n)  [13].  The  algorithm  for 
performing  this  banded  triangular  solve,  however,  requires  more  total  operations  than  the 
usual  method  of  backsolving  and  so  is  not  practical  when  the  number  of  processors  is  small 
compared  to  n.  The  standard  method  of  backsolving  for  a  sparse  triangular  linear  system, 
such  as  that  arising  from  a  nine-point  finite  element  approximation,  can  be  partially  parallel- 
ized. If  the  domain  of  the  problem  is  a  rectangle  and  if  the  natural  ordering  of  nodes  is 
used,  then   solution  components  can  be  computed  in  the  order  pictured  below: 


l« 

l«^ 

V 

^ 

n 

»^ 

vi 

"^^ 

^ 
^ 

'^^ 

^ 

u 

xi 

l«K. 

x\ 

k,^^ 

k 

^         ^ 

^ 

^: — ^ 

% 

<\ 

• 

L 

k           1 

1 

^ 

5       ' 

►        • 

Values  associated  with  the  nodes  along  each  (semi-)  diagonal  of  the  rectangle  can  be  solved 
for  simultaneously.  If  ^n  12  processors  are  available,  then  the  theoretical  execution  time 
for  this  backsolve  is  the  number  of  diagonals,  OiSn).  If  the  number  of  processors  P  is 
much  less  than  ^ n  II,  then  it  can  be  shown  as  in  [8]  that  almost  a  full  factor  of  P  speedup 
can  be  attained  in  this  operation. 
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The  MSOLVE  operation  in  the  HBCG  and  HBDSCG  methods  requires  translating  from 
the  hierarchical  basis  functions  to  the  nodal  basis  functions.  This  operation  can  be  per- 
formed recursively,  as  described  in  [16].  First,  nodal  basis  coefficients  are  computed  at  the 
nodes  of  the  coarsest  grid.  Values  at  the  nodes  of  the  next  finer  grid  are  then  taken  as  aver- 
ages of  the  already  computed  surrounding  nodal  values.  On  the  next  finer  grid,  coefficients 
are  again  computed  as  averages  of  the  surrounding  nodal  values,  which  have  already  been 
computed,  etc.  The  transpose  of  this  operation  is  performed  similarly,  only  now  one  starts 
with  the  nodes  that  belong  to  the  finest  grid  only  and  works  down  to  the  nodes  of  the  coar- 
sest grid.  If  n  processors  are  available,  then  the  time  to  perform  this  translation  is  the 
number  of  grid  levels,  Oilog  n).  If  the  number  of  processors  P  is  much  less  than  n,  then 
although  not  all  processors  can  be  kept  busy  all  of  the  time,  the  percentage  of  time  spent  on 
grids  with  fewer  than  P  points  is  small,  and  almost  a  full  factor  of  P  speedup  can  be  attained 
in  this  operation,  as  well.  It  should  be  noted  that  the  process  of  translating  from  hierarchical 
to  nodal  basis  functions  involves  recursively  working  on  each  grid  level,  much  as  in  the  mul- 
tigrid  method.  There  is  less  work  to  do  on  each  grid,  however,  as  there  is  no  relaxation  step 
but  only  an  interpolation-like  operation  to  perform.  Thus,  a  single  iteration  of  the  hierarchi- 
cal basis  conjugate  gradient  method  should  parallelize  somewhat  more  efficiently  than  a  sin- 
gle multigrid  cycle.  The  disadvantage  is  that  the  hierarchical  basis  method  requires  more 
iterations  (cycles)  than  the  multigrid  method  -  0{,log  h~^)  as  opposed  to  0(1). 

The  final  algorithm  considered,  NDCG,  uses  an  MSOLVE  routine  that  requires:  (1) 
solving  a  banded  linear  system  with  coefficient  matrix  B  =  B33  —  fiJsSia^Bis  of  order 
equal  to  the  number  of  crosspoints  between  N-boxes  and  of  half-bandwidth  equal  to  the 
number  of  crosspoints  in  a  row  of  the  grid,  (2)  multiplying  by  the  matrix  B23,  which  couples 
Neumann  points  other  than  crosspoints  to  the  crosspoints,  and  (3)  solving  the  block  diagonal 
matrix  S22.  where  each  block  represents  a  Neumann  problem  on  one  of  the  N-boxes. 
Because  the  conjugate  gradient  algorithm  is  applied  to  the  matrix  C  =  A22  ~  >4^2'4n^Ai2  in 
this  case,  rather  than  to  the  matrix  A,  the  "ATIMES"  routine  is  also  different  from  that 
used  with  the  other  methods.  It  requires  multiplying  by  the  submatrices  A12,  AI2,  and  A22 
and  solving  the  block  diagonal  matrix  An,  where  each  block  represents  a  Dirichlet  problem 
on  one  of  the  D-boxes. 

Depending  on  the  nature  of  the  problem  being  solved,  the  MSOLVE  and  ATIMES 
operations  might  be  carried  out  in  several  different  ways.  If  the  differential  operator  is 
equal  to  a  constant  times  the  Laplacian  on  each  box  and  if  the  grid  is  uniform,  then  fast 
Poisson  solvers  can  be  used  on  the  Dirichlet  and  Neumann  subregions  (though  it  is  necessary 
to  use  some  sort  of  embedding  or  Lagrange  multiplier  technique  [15]  on  the  Neumann  subre- 
gions, since  they  do  not  contain  the  corner  crosspoints).  The  work  required  for  the  fast 
solvers  is  approximately  10  m  log2m  floating  point  operations,  where  m~{d/h)^  is  the 
number  of  points  in  a  single  D-  or  N-box. 
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The  assumption  of  a  Laplace  operator  on  each  box  was  unsuitable  for  some  of  the 
problems  we  considered,  and  so  we  solved  the  equations  on  each  box  using  Gaussian  elimi- 
nation and  a  band  matrix  solver.  Each  diagonal  block  of  An  and  B22  was  factored  prior  to 
starting  the  conjugate  gradient  iteration,  and  then  at  each  iteration  the  triangular  factors 
were  solved.  The  number  of  floating  point  operations  required  for  each  backsolve  is 
approximately  Am^^.   The  total  work  involved  in  the  MSOLVE  operation  is  then: 

Solve  band  matrix  B:   4d~^ 

Multiply  by  B23:    I2d~'^ 

Solve  block  diagonal  matrix  B22'   2n(d/h). 

The  ATIMES  operation  for  NDCG  requires: 

Solve  block  diagonal  matrix  An:    2n(d/h) 
Multiply  by  A12  and /4I2:   24^/d 
Multiply  by  A22:    nAi/2. 

Note  that  if  d/h  is  kept  fixed  as  h-^,  then  the  order  and  bandwidth  of  the  matrix  B 
differ  by  only  a  constant  factor  from  those  of  the  original  matrix  A.  Hence,  factoring  B 
before  the  iteration  requires  Oin^)  operations,  just  as  are  required  to  factor  the  matrix  A 
directly.  At  each  iteration,  the  backsolve  on  the  banded  triangular  factors  requires  Oin^^) 
operations.  On  the  other  hand,  if  d  is  kept  fixed  as  h^,  then  the  order  and  bandwidth  of 
the  blocks  associated  with  each  D-  and  N-box  differ  by  only  a  constant  factor  from  those  of 
A.  Hence,  again,  0(«^)  operations  are  required  to  factor  the  blocks  and  C>(n^'^)  operations 
must  be  performed  at  each  iteration  to  backsolve  the  banded  triangular  factors.  Thus, 
asymptotically,  the  operation  count  for  the  NDCG  method  is  of  the  same  order  as  that  for 
direct  Gaussian  elimination,  unless  fast  solvers  or  some  other  special  solution  technique  can 
be  used  on  the  subproblems. 

Table  1  summarizes  the  estimated  iteration  counts,  work  per  iteration,  and  parallelism 
present  in  the  implementations  used  for  these  different  iterative  methods. 
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Method 

#  Iterations 

work/iteration 

time/iteration 
with  Oin)  processors 

time/iteration 
with  P  «  n  processors 

DSCG 

0{h-^) 

30  n 

Oilog  n) 

herial 
P 

ICCG 

0{h-^) 

46  n 

O(V^) 

P  «^  11:    -^ 

Oilog  /2-1) 

45  n 

Oilog  n) 

^serial 
P 

HBDSCG 

0{log  /j-i) 

46  n 

Oilog  n) 

^serial 
P 

NDCG 

Oilog  f) 

d/h  fixed,  h^: 

4(d/h)^  n^^ 
d  fixed,  h^: 

Adn'^'^ 

time  for  banded 

triangular  solve 

with  bandwidth  Oi^n  )  , 

size  Oin) 

r,   _        1            ^serial 
Id^-          P 

Table  1.   Work  and  Parallelism  in  Different  Iterative  Methods. 

4.   Numerical  Results. 

The  problem  considered  is  a  time-independent  version  of  the  diffusion  equation  defined 
on  the  unit  square  with  Dirichlet  boundary  conditions: 

V  ■  pV  u    =   f       on  (0,1)  X  (0,1) 

uix,0)  =  uix,\)  =  uiO,y)  =  uil,y)  =  0. 

Expressions  for  the  diffusion  coefficient  pix,y)  ranged  from  a  constant  to  a  mildly  varying 
continuous  function  to  a  strongly  varying  piecewise  constant.  The  right-hand  side /(at,)')  was 
taken  to  be 

fix,y)  =  -2>-(l->')  -  2x(l-;c), 

so  that  for  p=l  the  solution  was  uix,y)=xil—x)yil—y).  The  methods  described  previously 
were  used  to  solve  the  linear  systems  resulting  from  continuous  piecewise  bilinear  finite  ele- 
ment approximations  for  this  problem  on  various  rectangular  grids.  Results  reported  are  for 
uniform  grids,  though  similar  results  were  obtained  using  graded  meshes.  The  matrix  A 
arising  from  such  a  finite  element  approximation  (with  the  usual  nodal  basis  functions  and 
the  natural  ordering  of  nodes)  is  a  nine-striped  matrix,  as  pictured  in  section  2. 

Table  2  shows  the  number  of  iterations  required  by  each  of  the  methods  to  achieve  a 
relative  error  of  size  10"''  times  that  of  the  relative  error  in  the  initial  guess: 


V  -  X\\2 

V  -  X\\2 


10" 


The  "exact"  solution  x  was  obtained  by  iterating  for  many  steps,  until  the  pseudo-residual, 
M~^ib—Ax),  was  of  size  10"^  or  less.   In  each  case,  a  random  initial  guess  was  used. 


Ultracomputer  Note  126 


Page  13 


Note  that,  as  predicted  by  the  estimates  in  Table  1,  the  number  of  iterations  required  by 
the  DSCG  and  ICCG  methods  increases  more  rapidly  with  grid  size  than  does  that  required 
by  the  other  methods.  Note  also,  however,  that  while  the  convergence  rate  of  DSCG, 
ICCG,  and  HBDSCG  is  almost  independent  of  the  diffusion  coefficient  p(j:,>'),  the  other 
methods  —  HBCG  and  NDCG  --  slow  down  considerably  when  applied  to  problems  with 
discontinuous,  strongly  varying  diffusion  coefficient.  This  behavior  cannot  be  predicted 
from  the  asymptotic  estimates  of  Table  1.  The  degradation  in  performance  of  these  methods 
makes  them  impractical  for  the  types  of  problems  described  in  [10],  for  example,  despite  the 
mild  dependence  of  the  convergence  rate  on  the  mesh  size. 

The  estimates  of  Table  1  predict  convergence  of  the  NDCG  method  in  0(log  d/h)  itera- 
tions. Asymptotically,  this  is  correct,  though  the  iteration  counts  in  Table  2  seem  to  grow 
with  the  mesh  size  even  when  d/h  is  fixed.  This  is  because  the  grid  size  is  not  yet  fine 
enough  to  approach  the  asymptotic  bound.  For  a  128x128  grid  with  p(jr,>')  =  l,  NDCG  with 
d/h  =  S  required  15  iterations,  just  as  for  the  64x64  grid,  indicating  that  this  is  probably  close 
to  the  maximum  number  of  iterations  that  would  be  required  on  any  size  grid.  For  the 
discontinuous  diffusion  coefficient  problems,  the  size  of  grid  necessary  to  approach  the  con- 
stant bound  on  iteration  count  is  probably  much  finer,  and  the  constant  bound  much  larger 
than  the  number  of  iterations  listed  in  Table  2. 


P(x,y) 

grid  size 

#  Iterations                                                | 

DSCG 

ICCG 

HBCG 

HBDSCG 

NDCG 

(fs, 

NDCG 

1. 

16x16 

21 

8 

14 

14 

6 

32x32 

43 

13 

18 

18 

11 

7 

64x64 

84 

25 

21 

21 

15 

13 

if  j:  <  .5,    p  =  1. 

if  X  >  .5,    p  =  100. 

16x16 

23 

8 

88 

15 

8 

32x32 

49 

14 

»100 

19 

29 

9 

64x64 

>100 

28 

»100 

28 

41 

32 

if  j:<.5,y<.5,    p=l. 

ifx>:.5,>'<.5,    p=l.£:4 

if  j:<.5,>'>.5,    p=1.£-4 

if  jT^.S.y^.S,    p  =  l. 

16x16 

24 

9 

»100 

12 

7 

32x32 

48 

15 

»100 

14 

32 

9 

64x64 

97 

24 

»100 

15 

>100 

61 

.01  +  x^  +  y^ 

16x16 

22 

8 

51 

15 

7 

32x32 

43 

13 

76 

19 

11 

7 

64x64 

84 

25 

»ino 

23 

15 

n 

Table  2.   Number  of  Iterations  Required  with  Various  Grid  Sizes  and  Different  Functions  pix,y). 
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For  comparison,  we  include  in  Table  2b  another  set  of  tests  run  on  the  same  problems, 
but  this  time  with  a  zero  initial  guess.  Note  that  the  iteration  counts  in  Table  2b  are  consid- 
erably better  than  those  of  Table  2,  for  all  methods  tested,  with  some  methods  showing 
more  of  a  difference  than  others.  This  is  because  with  a  smooth  (and  symmetric)  right-hand 
side  and  a  zero  initial  guess,  the  initial  residual  is  entirely  low  frequency.  It  is  deficient  in 
many  of  the  eigenvectors  of  the  iteration  matrices  and  so  one  sees  faster  convergence  than 
could  be  prediaed  by  spectral  analysis.  While  smoothness  of  the  initial  residual  might  be  a 
reasonable  assumption  for  some  problems,  we  will  generally  report  results  with  the  random 
initial  guess,  to  better  illustrate  the  theory.  It  should  be  pointed  out  that  many  of  the  results 
reported  in  the  literature  [e.g.,  6,11]  use  a  zero  initial  guess  with  a  similar  right-hand  side 
function,  and  this  should  be  taken  into  account  by  the  reader,  since  it  can  significantly  affect 
iteration  counts  and  comparisons. 


p(^.y) 

grid  size 

#  Iterations 

DSCG 

ICCG 

HBCG 

HBDSCG 

NDCG 
(f  =  8) 

NDCG 

1. 

16x16 

10 

6 

9 

9 

2 

32x32 

19 

9 

11 

11 

8 

2 

64x64 

37 

16 

12 

12 

10 

9 

if  X  <  .5,    p  =  1. 
if  X  >  .5,    p  =  100. 

16x16 

18 

6 

65 

10 

7 

32x32 

35 

10 

86 

11 

21 

8 

64x64 

70 

20 

»100 

13 

30 

25 

if  x<.5,  y<.5.    p=l. 
ifx>.5,y<.5.    p=l.E4 
if  x<.5,y>.5,    p=l.E-4 
if  a:>.5,  y>.5.    p=l. 

16x16 

11 

8 

»100 

8 

5 

32x32 

22 

13 

»100 

10 

28 

5 

64x64 

43 

28 

»100 

12 

69 

33 

.01   +  x^  +  y^ 

16x16 

19 

7 

41 

11 

5 

32x32 

37 

12 

54 

12 

9 

5 

64x64 

74 

22 

59 

13 

11 

11 

Table  2b.    Number  of  Iterations  Required  with  Various  Grid  Sizes 
and  Different  Functions  p(j:,y),  using  Zero  Initial  Guess. 
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Table  2  shows  only  numbers  of  iterations,  but  as  seen  from  Table  1,  some  of  the 
methods  require  more  work  per  iteration  than  others.  In  Table  3,  timings  are  reported  for 
the  various  (serial)  codes  run  on  the  NYU  Ultracomputer  Prototype  [7].  The  timings  are  for 
the  case  p(jc,>')  =  l,  with  a  64x64  grid  and  a  random  initial  guess.  The  time  per  iteration 
determined  for  this  case  can  be  used  with  the  iteration  counts  of  Table  2  to  compute  relative 
performance  for  the  other  problems.  The  time  per  iteration  is  roughly  proportional  to  the 
work  per  iteration,  given  in  Table  1.  This  is  to  be  expected  on  a  single  processor  with  no 
vector  capabilities.  The  comparisons  of  Table  3  might  look  significantly  different,  however, 
on  a  vector  processor  such  as  the  Cray  1,  especially  if  all  or  some  sections  of  the  codes  were 
carefully  optimized  in  assembly  language.  In  particular,  the  DSCG  timings  would  look 
better  compared  to  ICCG  because  of  better  vectorization.  Also  shown  in  Table  3  is  a  com- 
parison of  the  iterative  methods  with  a  direct  band  matrix  solver,  SPBFA/SL,  from  LIN- 
PACK  [5].  It  is  seen  that  for  this  problem  all  of  the  iterative  methods  outperform  the  band 
solver  on  this  machine  though,  again,  the  comparison  might  look  different  on  a  vector 
machine. 

Timings  are  given  for  the  individual  parts  of  the  codes  described  in  section  3.  The 
setup  times  consists  of  the  time  to  invert  the  diagonal  of  A  in  the  DSCG  and  HBDSCG 
codes,  the  time  to  form  the  incomplete  Cholesky  decomposition  in  the  ICCG  code,  and  the 
time  to  factor  each  D—  and  A'—  block  and  the  Shur  complement  matrix  corresponding  to  the 
crosspoints  in  the  NDCG  code.  It  is  seen  that  most  parts  contribute  significantly  to  the  total 
time  required  by  the  various  methods.  Hence,  if  a  good  speedup  is  to  be  obtained  through 
parallelization,  it  will  be  necessary  to  parallelize  almost  all  sections  of  the  codes.  A  modi- 
fied code,  ICCG',  is  included  in  Table  3.  This  uses  the  same  algorithm  as  the  original  ICCG 
code  but  determines  the  order  in  which  solution  components  can  be  computed  to  take  advan- 
tage of  parallelism  in  the  MSOLVE  operation  (the  diagonal  order  described  in  section  2), 
and  then  solves  for  the  components  in  this  order.  (See  [8].)  The  original  ICCG  code  solves 
with  the  natural  ordering  and  is  slightly  more  efficient  as  a  serial  code.  It  is  the  ICCG'  code 
that  will  be  parallelized,  however,  and  speedups  reported  will  be  the  ratio  of  the  time  for 
ICCG'  with  1  processor  to  its  time  with  P  processors.  The  reader  should  keep  in  mind  that 
ICCG'  requires  a  small  amount  of  additional  overhead  in  the  setup  and  MSOLVE  sections 
and  so  will  represent  an  improvement  over  ICCG  only  if  the  MSOLVE  operation  parallel- 
izes well  enough  to  make  up  for  this  additional  overhead. 
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section 

DSCG 

ICCG 

ICCG' 

HBDSCG 

NDCG 
(f  =  8> 

NDCG 

time 

% 

time 

% 

time 

% 

time 

% 

time 

% 

time 

% 

setup 

22 

0 

1208 

5 

1912 

7 

22 

0 

2388 

8 

4957 

12 

ATIMES 

23341 

46 

7107 

27 

7202 

26 

6011 

29 

9114 

29 

11137 

28 

MSOLVE 

2161 

4 

10593 

40 

10804 

38 

8367 

40 

16954 

54 

21633 

54 

other 
(BLAS) 

24814 

49 

7517 

28 

8256 

29 

6363 

31 

2801 

9 

2206 

6 

TOTAL 

50338 

26425 

28174 

20763 

31257 

39933 

#  iters             84                       25                       25                       21                        15                       13 
time/iter          599                    1009                   1050                    988                    1925                   2690 

Direct  Solution 

SPBFA 

SPBSL 

Total 

65134 

3897 

69031 

Table  3.    Serial  Code  Timings    (p(j:,>')  =  1.,  64x64  grid,  random  initial  guess). 


The  codes  were  parallelized  in  a  very  simple  way.  All  arrays  and  a  few  frequently  used 
variables  were  declared  SHARED,  and  appropriate  DO  loops  were  replaced  by 
DOALL/ENDALL  constructs  [2],  so  that  different  iterates  could  be  executed  in  parallel. 
DOALL  prespawns  processes  and  assigns  work  to  them  when  it  is  available.  When  a 
DOALL  statement  is  encountered,  each  process  takes  an  index  from  the  loop  and  performs 
the  work  for  that  index  then  returns  to  the  queue  to  grab  another  index.  No  work  after  the 
ENDALL  statement  can  be  performed  until  the  work  associated  with  each  loop  index  has 
been  completed. 

Table  4  shows  the  parallel  code  timings  (for  the  64x64  grid,  with  p(j:,>')  =  l,  and  a  ran- 
dom initial  guess)  and  speedups  using  1,  2,  4,  6,  and  8  processors.  These  timings  are  plotted 
in  Fig.  1,  with  optimal  speedup  represented  by  the  dashed  lines.  Using  8  proces.<;ors,  most 
methods  obtained  about  a  factor  of  6  speedup  over  their  execution  time  on  a  single  proces- 
sor.    Note,   however,   that   the   time   for   the   parallel   codes  run   on   a   single  processor   is 
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somewhat  greater  (about  15-30%  greater)  than  the  time  for  the  serial  codes  shown  in  Table 
3.  This  is  due  to  the  added  overhead  of  accessing  shared  variables  and  taking  loop  indices 
from  a  queue. 

Only  ICCG'  failed  to  achieve  nearly  optimal  speedup,  because  the  MSOLVE  portion  of 
this  code  achieved  only  a  factor  of  2.39  speedup.  All  sections  of  the  codes  (except  the  setup 
in  DSCG,  ICCG',  and  HBDSCG)  were  parallelized,  and  all  parts  (except  the  MSOLVE  in 
ICCG')  obtained  approximately  this  factor  of  6  speedup.  The  ATIMES  routine  actually 
achieved  about  a  faaor  of  6.7  speedup,  enhanced  by  unrolling  of  DO  (DOALL)  loops.  It  is 
believed  that  a  larger  problem  size  would  result  in  somewhat  better  speedup,  up  to  about  a 
factor  of  seven.  Bus  traffic  on  the  current  Ultracomputer  Prototype  appears  to  limit  the 
speedup  of  codes  using  shared  variables  to  about  this  factor  of  seven.  There  is  a  slight  bend 
in  the  speedup  curve  for  NDCG  at  P  =  6  processors.  This  is  because  the  number  of  blocks 
(64)  is  not  a  multiple  of  the  number  of  processors  and  so  there  is  some  delay  due  to  unbal- 
anced workloads. 


#  proc. 

section 

DSCG 

ICCG' 

HBDSCG 

NDCG(|-=8) 

time 

spdup 

time 

spdup 

time 

spdup 

time 

spdup 

1 

setup 
ATIMES 
MSOLVE 

BLAS 

22 
27102 
4881 
32499 

1^50 
8343 
17162 
9822 

22 
7084 
10844 
8241 

3463 
10761 
18022 
3652 

Tola] 

64504 

37277 

26191 

35898 

2 

setup 
ATIMES 
MSOLVE 

BLAS 

23 
13759 
2514 
16898 

1 
1.97 
1.94 
1.92 

1958 
4203 
11484 
5167 

1 
1.99 
1.49 
1.90 

22 
3550 
5605 
4274 

1 
1.99 
1.93 
1.93 

1774 
5492 
9197 
1895 

1.95 
1.96 
1.96 
1.93 

Total 

33194 

1.94 

22812 

1.63 

13451 

1.95 

18358 

1.96 

4 

setup 
ATIMtS 
MSOLVE 

BLAS 

22 
7223 
1291 
9086 

1 
3.75 
3.78 
3.58 

1950 
2166 
7858 
2762 

1 
3.85 
2.18 
3.56 

22 
1845 
2989 
2292 

1 
3.84 
3.63 
3.60 

945 
2899 

4884 
1058 

3.66 
3.71 
3.69 

3.45 

Total 

17622 

3.66 

14736 

2.53 

7148 

3.66 

9786 

3.67 

6 

setup 

ATIMES 

MSOLVE 

BLAS 

22 
5026 
948 
6750 

1 
5.39 

5.15 
4.81 

1942 
1537 
7225 
2040 

1 
5.43 
2.38 

4.81 

22 

1305 
2215 
1676 

1 

5.43 
4.90 
4.92 

728 
2403 
3713 
785 

4.76 
4.49 
4.85 
4.65 

Total 

12746 

5.06 

12746 

2.92 

5218 

5.02 

7629 

4.71 

8 

setup 
ATIMES 
MSOLVE 

BLAS 

22 
4046 
745 
5769 

1 
6.70 
6.55 
5.63 

1949 
1248 
7185 
1733 

1 
6.69 
2.39 
5.67 

22 

1056 
1777 
1381 

1 
6.71 
6.10 
5.97 

551 
1692 
2368 
651 

6.28 
6.36 
6.28 
5.61 

Total 

10582 

6.10 

12115 

3.08 

4236 

6.18 

5762 

6.23 

Table  4.    Parallel  Code  Timings    (p(a:,>')  =  1.,  64x64  grid,  random  initial  guess). 
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It  was  stated  in  section  3  that  for  n  large  enough  C^n/2  »  P),  the  ICCG'  code  should 
also  achieve  almost  a  full  factor  of  P  speedup.  This  was  based  on  a  theoretical  argument, 
however,  estimating  the  amount  of  parallelism  present,  and  it  did  not  take  into  account  the 
overhead  for  accessing  queues  and  checking  on  completion  of  DOALL  loops.  The  larger  the 
problem  size  n,  the  greater  the  number  of  DOALL  loops  in  the  MSOLVE  routine,  there 
being  one  for  each  of  the  approximately  3  ^n  diagonals  of  the  grid.  Additionally,  more 
total  queue  accesses  are  required  to  grab  the  total  0{n)  loop  indices.  It  is  reasonable  to 
assume  that  the  overhead  associated  with  executing  and  terminating  DOALL's  in  the 
MSOLVE  routine  using  P  processors  is  of  the  form: 

Op  =  ap^n  +  bpn  ,      ap,  bp  independent  of  n. 

If  this  overhead  is  neglected,  then  the  time  to  perform  the  MSOLVE  operation  in 
ICCG'  can  be  estimated  based  on  the  amount  of  work  in  each  stage  of  the  computation. 
Define  a  unit  of  time  to  be  the  time  required  to  solve  a  single  equation  of  the  sparse  triangu- 
lar linear  system  (using  one  processor),  and  assume  that  ^n/2  is  a  multiple  of  the  number 
of  processors  P .  Following  the  diagonal  pattern  of  the  grid  shown  in  section  3,  one  finds  as 
in  [8]  that  the  time  required  to  perform  the  backsolve  using  P  processors  is 

tp    =    2P*1  +  2P*2  +...+  2P  *  —j-^    +    (^  +  2(P-1))  *  — B^    + 
2P  *  (     p       -  1)  +...+  2P*1 

=    f  +  ^(l-f). 
With  one  processor  the  time  is 

and  so  the  speedup  is  given  by 


t\  =  n  , 


n  1 


1  +  ^  ■ 

^  n 

Combining  the  expressions  for  t\  and  t%  with  the  expressions  for  the  overhead  Oi  and 
O^,  we  can  express  the  measured  values  t^  and  fg  in  the  form 

rg  =  agV>,   +  PgAi  . 

Measuring  ti  and  rg  for  v«  =  48,  64,  96,  and  128,  we  were  able  to  estimate  the  parame- 
ters aj.  Pi,  ag,  and  3g,  and  thereby  determine  the  size  of  problem  necessary  to  give  a 
speedup  in  the  ICCG'  MSOLVE  routine  comparable  to  that  seen  in  the  other  algorithms  for 
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Vn  =64.   The  parameters  determined  by  a  least  squares  fit  to  the  data  were 

ai  =  1.7261,     Pi=.1381,     ag  =  3.1932,     pg  =  .0205  . 

Plotted  in  Figure  2  are  the  observed  and  estimated  speedups  in  this  routine  for  problem  sizes 
ranging  up  to  ^^  =  800.  Due  to  memory  limitations  on  the  ultracomputer  prototype,  the 
largest  problem  we  were  actually  able  to  solve  was  vn  =  128.  If  the  model  is  correct,  and  if 
the  coefficients  ai,  pi,  ag,  Pg  computed  from  our  data  are  accurate,  then  this  implies  that  a 
problem  size  of  approximately  '^n  =  \\54  would  be  needed  to  obtain  a  factor  of  6  speedup 
in  this  routine,  and  a  factor  of  7  speedup  with  8  processors  would  not  be  possible.  Thus,  in 
practice,  it  may  not  be  possible  to  attain  close  to  a  full  factor  of  P  speedup  with  P  proces- 
sors, because  the  overhead  grows  with  the  problem  size.  It  should  be  pointed  out  that  these 
estimate  of  problem  size  needed  to  achieve  a  certain  speedup  are  extremely  sensitive  to  the 
parameters  ai,pi,ag,Pg  obtained  from  measured  timings.  Hence  they  should  be  regarded 
only  as  rough  estimates. 

5.  Conclusions. 

Results  of  these  experiments  show  the  hierarchical  basis  method  with  diagonal  scaling 
(HBDSCG)  to  be  particularly  efficient  on  both  serial  and  parallel  architectures.  It  should 
also  be  compared  with  other  multigrid  methods  (such  as  the  multigrid  V-cycle),  and  com- 
parisons should  be  made  on  vector  machines  as  well. 

The  problems  presented  here  are  all  relatively  easy  to  solve  on  current  serial  proces- 
sors. The  real  payoff  from  parallelism  is  likely  to  come  for  larger  3-D  problems  run  on 
machines  with  huge  memories  and  many  more  processors.  One  might  conjecture  that  if 
eight  processors  can  be  used  effectively  on  a  2-D  problem,  then  eight  hundred  could  be  used 
on  a  3-D  problem  consisting  of  one  hundred  2-D  pieces.  Of  course,  this  is  an  oversimplifi- 
cation, especially  in  the  case  of  HBDSCG,  since  its  convergence  rate  does  not  appear  as  good 
in  3-D  as  in  2-D  [16].  To  test  the  effectiveness  of  these  algorithms  on  problems  of  real 
interest,  we  need  machines  with  more  processors  and  very  large  memories. 
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Fig.  1.  Parallel  Code  Timings  (64x64  grid,  p(x,y)=l.,  random  initial  guess) 
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Fig.  2.  Measured  (x)  and  predicted  speedup  of  the  MSOLVE  routine  in 
ICCG'  using  8  processors  for  different  problem  sizes  yn". 


